Best 

Available 

Copy 


AD-A282  285 


ATION  PAGE 


Form  Approved 
OBUNo  0704*0196 


m 


1 .  AGENCY  USE  ONLY  (Usova  Blank) 


4.  TITLE  ANOSUBTTTLE 


2.  REPORT  DATE 

November  1993 


Example  Based  Image  Analysis  and  Synthesis 


6  AUTHOR(S) 

David  Beymer,  Amnon  Shashua,  and  Tomaso  Poggio 


3.  REPORT  TYPE  AND  DATES  COVERED 

memorandum 


5.  FUNDING  NUMBERS 

N00014-91-J-1270 

N00014-92-J-1879 

ASC-9217041 

N00014-92-J-4038 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESSES) 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  Massachusetts  02139 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESSES) 

Office  of  Naval  Research 
Information  Systems 
Arlington,  Virginia  22217 


\c 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


AIM  1431 
CBCL  80 


10.  SPONSORINCVMONITORING 
AGENCY  REPORT  NUMBER 


12a  DISTRIBUTION/AVAI LABILITY  STATEMENT 


12b.  DISTRIBUTION  CODE 


DISTRIBUTION  UNLIMITED 


13.  ABSTRACT  (Maximum  200  words) 

Image  analysis  and  graphics  synthesis  can  be  achieved  with  learning  techniques  using  directly  image 
examples  without  physically-based,  3D  models.  In  our  technique: 

-  the  mapping  from  novel  images  to  a  vector  of  "pose"  and  "expression"  parameters  can  be  learned 
from  a  small  set  of  example  images  using  a  function  approximation  technique  that  we  call  an  analysis 
network; 

-  the  inverse  mapping  from  input  "pose"  and  "expression"parameters  to  output  images  can  be 
synthesized  from  a  small  set  of  example  images  and  used  to  produce  new  images  using  a  similar  synthesis 
network. 

The  techniques  described  here  have  several  applications  in  computergraphics,  special  effects,  interactive 
multimedia  and  very  lowbandwidth  teleconferencing. 


14.  SUBJECT  TERMS 

computer  graphic  networks 

computer  vision  teleconferencing 

image  compression  computer  interfaces 


15.  NUMBER  OF  PAGES 
21 


16.  PRICE  CODE 


17.  SECURITY  CLASSIFICATION  18.  SECURITY  CLASSIFICATION  19.  SECURITY  CLASSIFICATION  I  20.  LIMITATION  OF 
OF  REPORT  OF  THIS  PAGE  OF  ABSTRACT  ABSTRACT 


UNCLASSIFIED 


[i'ii  wa'  $w 


UNCLASSIFIED 


UNCLASSIFIED 


UNCLASSIFIED 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
ARTIFICIAL  INTELLIGENCE  LABORATORY 


I 


A.I.  Memo  No.  1431 
C.B.C.L.  Paper  No.  80 

Example  Based  Image  Analysis  and  Synthesis 
D.  Beymer,  A.  Shashua  and  T.  Poggio 
Abstract 


November,  1993  ^ 


By _ _ 

Dut  ibution  / 


Image  analysis  and  graphics  synthesis  can  be  achieved  with  learning  techniques  using  directly  image 
examples  without  physically-based,  3D  models.  We  describe  here  novel  techniques  for  the  analysis  and 
the  synthesis  of  new  grey-level  (and  color)  images.  With  the  first  technique, 


Availability  Codes  | 

Dist 

&L 

Avail  c 
Spe 

md /or 
cial 

•  the  mapping  from  novel  images  to  a  vector  of  “pose”  and  “expression”  parameters  can  be  learned 
from  a  small  set  of  example  images  using  a  function  approximation  technique  that  we  call  an  analysis 
network ; 

•  the  inverse  mapping  from  input  “pose”  and  “expression”  parameters  to  output  grey-level  images  can 
be  synthesized  from  a  small  set  of  example  images  and  used  to  produce  new  images  under  real-time 
control  using  a  similar  learning  network,  called  in  this  case  a  synthesis  netwr  :■  k. 


This  technique  relies  on  (i)  using  a  correspondence  algorithm  that  matches  corresponding  pixels  among 
pairs  of  grey-level  images  and  effectively  “vectorizes”  them,  and  (ii)  exploiting  a  class  of  multidimensional 
interpolation  networks  -  Regularization  Networks  -  that  approximate  the  nonlinear  mapping  between  the 
vector  input  and  the  vector  output. 

We  also  describe  a  second  technique  for  analysis  and  synthesis  of  images,  which  can  be  formulated  within 
the  same  theoretical  framework  and  discuss  the  somewhat  different  implementation  tradeoffs  in  terms  of 
memory  and  computation. 

As  a  third  contribution,  we  introduce  an  approach  for  generating  novel  grey-level  images  from  a  single 
example  image  by  learning  the  desired  transformation  from  prototypical  examples. 

The  three  techniques  described  here  have  several  applications  in  computer  graphics,  special  effects,  in¬ 
teractive  multimedia  and  object  recognition  systems.  The  analysis  network  can  be  regarded  as  a  passive 
and  trainable  universal  interface,  that  is  a  control  device  which  may  be  used  as  a  generalized  computer 
mouse,  instead  of  “gloves”,  “body  suits"  and  joy  sticks.  The  synthesis  network  is  an  unconventional  and 
novel  approach  to  computer  graphics.  The  techniques  described  here  can  be  used  for  very  low  bandwidth 
teleconferencing  and  interactive  simulations. 
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1  Introduction 

The  classical  synthests  problem  of  computer  graphics  is 
the  problem  of  generating  novel  images  corresponding 
to  an  appropriate  set  of  "pose”  control  parameters.  The 
inverse  analysis  problem  -  of  estimating  pose  and  ex¬ 
pression  parameters  from  images  -  is  the  classical  prob¬ 
lem  of  computer  vision.  In  the  last  decade  both  fields 
of  research  have  approached  their  respective  problems 
of  analysis  and  synthesis  using  intermediate  physically- 
based  models.  Computer  graphics  has  developed  sophis¬ 
ticated  3D  models  and  rendering  techniques  -  effectively 
simulating  the  physics  of  rigid  and  non-rigid  solid  bodies 
and  the  physics  of  imaging.  Part  of  computer  vision  has 
followed  a  parallel  path:  most  object  recognition  algo¬ 
rithms  use  3D  object  models  and  exploit  the  properties 
of  geometrical  and  physical  optics  to  match  images  to 
the  data  base  of  models. 

In  a  series  of  recent  papers  we  have  advocated  a  differ¬ 
ent,  simpler  approach  which  bypasses  the  use  of  physical, 
3D  models  (though  it  may  be  used  with  3D  views,  see 
later).  The  approach  has  advantages  and  disadvantages. 
Relative  to  the  traditional  approach  its  main  feature  is  to 
trade  computation  with  memory.  In  one  metaphor  that 
aptly  describes  it,  we  speak  of  learning  the  mapping  - 
from  images  to  pose  or  from  pose  to  images  -  from  a 
set  of  examples,  that  is  pairs  of  images  and  associated 
pose  parameters.  The  learned  mapping  is  then  used  to 
estimate  the  pose  for  a  novel  image  of  the  same  type, 
whereas  the  inverse  mapping  can  provide  a  novel  image 
for  a  new  value  of  the  control  parameters. 

1.1  Background  of  the  General  Approach 

When  we  first  introduced  our  technique  based  on  Regu¬ 
larization  Networks  (Poggioand  Girosi,  1990)  to  "solve” 
the  analysis  problem  -  from  images  to  “pose”  parameters 
-  we  demonstrated  it  only  for  artificial,  very  simple  im¬ 
ages  (see  Poggio  and  Edelman,  1990).  On  the  synthesis 
problem  we  were  able  to  demonstrate  that  our  approach 
is  effective  for  storing,  interpolating  among,  and  even 
extrapolating  from  professional  quality,  hand-drawn  and 
colored  illustrations  (see  Poggio  and  Bruneili,  1992;  Li- 
brande,  1992).  In  the  learning  metaphor,  the  computer 
“learns”  to  draw  a  class  of  objects  from  a  small  set  of  ex¬ 
amples  provided  by  the  artist,  under  user  control.  Each 
of  the  objects  modeled  (e.g.,  eyes,  mouths,  aircraft  sil¬ 
houettes,  and  the  head  of  the  cartoon  character  Garfield) 
were  learned  from  a  only  a  few  digitized  drawings.  The 
sets  of  training  images  were  digitized  either  directly,  us¬ 
ing  a  pen  and  tablet,  or  else  from  printed  media  using 
an  ordinary  flat-bed  scanner.  The  2 D  control  points 
used  in  the  internal  representations  of  these  objects  were 
then  obtained  semi-automatically  using  recently  devel¬ 
oped  algorithms  for  automatic  contour  tracing  (Lines, 
pers.  com.)  and  contour  matching  from  a  small  number 
of  manually  matched  points  (Librande,  1992)).  Poggio 
and  Bruneili  (1992)  had  suggested  ways  to  extend  the 
approach  from  line  drawings  to  grey-level  images  and 
demonstrated  the  extension  in  a  simple  case  of  video 
images  of  a  walking  person.  Their  approach  relied  on 
manual  correspondence  established  between  a  few  key 
points  in  the  example  images  (Chen  et  al.  1993  used  the 


Poggio- Bruneili  approach  by  exploiting  the  correspon¬ 
dence  provided  by  the  range  data  associated  with  their 
color  images). 

The  more  traditional  approach  to  the  analysis  and 
especially  the  synthesis  problem  is  based  on  explicit  3D 
models  of  the  objects  (Aizawa,  Harashima  and  Saito, 
1989;  Nakaya  et  al.  1991;  Choi  et  al.  1991;  Li  el  al. 
1993;  Terzopoulos  &  Waters  1993,  Oka  et  al.  1987;  see 
Poggio,  1991).  Parameters  of  a  model  of  a  generic  face 
can  be  estimated  from  one  or  more  images  and  then  used 
to  generate  an  accurate  3D  model  of  the  particular  face 
which  can  then  in  turn  be  used  to  generate  new  images 
of  the  face.  Such  an  approach  is  being  developed  mainly 
for  very  low  bandwidth  teleconferencing,  an  application 
that  we  are  also  going  to  describe  later  in  this  paper. 

1.2  Contributions  and  plan  of  the  paper 

In  this  paper  we  show  how  to  extend  our  previous  ap¬ 
proach  from  line  drawing  and  computer  simulated  ob¬ 
jects  to  grey-level  and  color  images  in  a  completely  auto¬ 
matic  way,  describe  some  experiments  with  face  images, 
discuss  related  theoretical  results  and  mention  various 
applications  of  the  technique.  The  detailed  plan  of  the 
paper  is  as  follows.  We  introduce  first  our  notation  for 
computing  and  representing  images  as  vectors.  We  then 
describe  the  analysis  network  and  show  examples  of  its 
performance.  The  synthests  network  is  introduced  next 
with  experimental  examples  of  the  synthesis  of  face  im¬ 
ages.  Section  5  then  describes  several  theoretical  aspects 
of  our  approach.  First,  we  summarize  the  Regulariza¬ 
tion  Networks  technique  that  we  use  for  learning-from- 
examples;  second,  we  introduce  solutions  to  the  corre¬ 
spondence  problem  between  images  that  represent  a  key 
step  for  automatizing  the  whole  process.  Section  5  also 
describes  how  linear  combination  of  images  -  in  the  ap¬ 
propriate  representation  -  is  intimately  related  to  the 
Regularization  Networks  but  can  also  be  justified  in  in¬ 
dependent  ways.  Section  6  describes  a  new  technique 
for  generating  novel  views  from  a  single  image  of,  say, 
a  face  without  any  3D  model  but  simply  by  “learning" 
the  appropriate  transformations  from  example  views  of 
a  prototype  face,  extending  a  technique  originally  sug¬ 
gested  by  Poggio  and  Bruneili  (see  also  Poggio,  1991  and 
Poggio  and  Vetter,  1992).  Notice  that  although  all  our 
examples  use  face  images,  our  techniques  can  be  applied 
to  images  of  any  rigid  or  flexible  object  or  collection  of 
objects  to  deal  with  transformations  of  viewpoint,  shape, 
color  and  texture.  In  the  final  section  we  discuss  some  of 
the  many  applications.  The  appendices  provide  details 
on  theoretical  and  performance  aspects. 

2  Notation 

Our  approach  (see  for  instance  Poggio  and  Bruneili, 
1992)  is  to  represent  images  in  terms  of  a  vector  of  x, 
y  of  corresponding  feature  locations.  These  features  can 
run  the  gamut  from  sparse  features  with  semantic  mean¬ 
ing,  such  as  the  corners  of  the  eyes  and  mouth,  to  pixel 
level  features  that  are  defined  by  the  local  grey  level 
structure  of  the  image.  In  this  paper  we  follow  the  lat¬ 
ter  approach.  Images  are  vectorized  by  first  choosing  a 


reference  image  and  then  finding  pixel  level  correspon¬ 
dence  between  the  two,  using,  for  example,  one  of  the 
optical  flow  or  stereo  algorithms  developed  in  computer 
vision.  Grey-level  or  color  images  are  synthesized  from 
the  vectorized  representation  by  using  simple  2D  image 
warping  operations. 

Vectorization  of  images  at  the  pixel  level  requires  the 
solution  of  a  difficult  correspondence  problem.  We  found 
however  (see  also  Shashua  1992b,  Jones  and  Poggio,  in 
preparation )  that  several  correspondence  algorithms  per¬ 
form  satisfactorily  for  finding  pixel  level  correspondence 
between  grey  level  images  of  faces  that  are  not  too  differ¬ 
ent.  We  will  now  introduce  two  operators  for  converting 
back  and  forth  between  the  image  representation  and  the 
pixel  level  vectorized  representation. 

Let  img  be  an  image  that  we  want  to  convert  to  vec¬ 
tor  representation  y.  In  the  pixelwise  representation  we 
are  using,  y  represents  pixelwise  correspondences  from 
some  reference  image  imgrfj  to  img.  The  vect  operator 
computes  these  pixelwise  correspondences,  in  our  case 
using  one  of  the  optical  flow  algorithms 

y  —  vect  (img,  imgref ). 

In  our  examples,  imgrej  might  be  a  frontal,  expression¬ 
less  view  of  the  person’s  face,  and  img  will  be  a  ro¬ 
tated  view  or  a  view  with  facial  expression.  Once  images 
that  represent,  different  transformations  from  the  refer¬ 
ence  image  have  been  vectorized,  new  faces  that  com¬ 
bine  these  transformations  can  be  synthesized  using  our 
learning  techniques,  which  are  of  course  equivalent  to 
multivariate  approximation  schemes. 

After  computations  in  vectorized  image  space,  one 
usually  ends  up  with  a  vectorized  image  that  one  would 
like  to  view  as  an  image.  Borrowing  from  the  computer 
graphics  nomenclature,  we  call  this  process  rendering  a 
vectorized  image.  Because  a  vectorized  image  y  sim¬ 
ply  characterizes  feature  geometry,  to  render  it  we  need 
to  sample  the  image  texture  from  an  example  image, 
imgter.  The  rend  operator  synthesizes  an  image  with 
the  feature  geometry  of  y  and  the  texture  of  imgtei 

img  —  rend(y,  imgter,yUT ), 

where  yteT  is  the  vectorized  version  of  imgteT.  If  imgtex 
is  the  same  as  the  reference  image  imgref,  then  imple¬ 
menting  the  rend  operator  is  simple.  In  this  case,  a  2D 
warp  is  applied  to  imgter,  pushing  the  pixels  in  imgter 
along  the  flow  vectors  in  y.  However,  in  general  case 
where  imgt(T  and  imgrej  may  be  different,  y  must  first 
be  transformed  to  change  the  reference  image  to  imgteT. 
How  to  perform  this  change  in  reference  frame  is  ex¬ 
plained  in  appendix  D. 

The  vect  and  rend  operators  will  be  used  in  the  fol¬ 
lowing  sections  where  we  show  the  application  of  vector¬ 
ized  images  in  example-based  image  synthesis  and  anal¬ 
ysis. 

3  The  Analysis  Network:  learning  to 
estimate  expression  and  pose 
parameters  from  grey  level  images 

Consider  the  problem  of  learning  to  estimate  the  pose 
and  expression  parameters  (including  parameters  asso¬ 


ciated  with  color  or  texture)  of  novel  input  images  given 
a  set  of  example  images  of  a  similar  type.  An  analysis 
network  which  successfully  estimates  pose  and  expres¬ 
sion  parameters  from  images  may  be  used  as  a  train- 
able  interface  for  a  number  of  tasks  such  as  to  drive 
an  example-based  synthesis  network.  As  we  will  discuss 
later,  the  "pose"  parameters  estimated  by  the  network  of 
this  section  could  be  used  to  encode  information  in  tele¬ 
conferencing  applications  and  other  image  compression 
tasks.  Another  application  is  actor  controlled  animation, 
where  an  analysis  module  has  examples  of  the  (human) 
actor  and  works  as  a  passive  "body  suit”.  The  analysis 
parameters  may  be  used  to  "direct”  another  person,  who 
is  represented  by  examples  at  the  synthesis  module. 

The  idea  underlying  the  analysis  network  is  to  synthe¬ 
size  the  mapping  between  a  grey-level  or  color  image  and 
the  associated  "pose”  parameters  by  using  a  suitable  set 
of  examples  -  pairs  of  images  and  associated  pose  param¬ 
eters  -  and  a  Regularization  Network  for  “learning”  from 
them.  The  network,  shown  in  figure  1  and  explained 
more  fully  in  a  later  section,  "learns"  the  mapping  from 
inputs  to  outputs  using  pairs  of  input/output  vectors 
(x,,  y,)  as  examples.  In  this  section,  the  outputs  y,  are 
estimates  of  the  location  in  pose/expression  space  of  the 
input  examples,  which  are  vectorized  images  x, .  When 
presented  a  new  example  image  x,  the  “trained"  net¬ 
work  will  produce  an  estimate  y  of  the  pose/expression 
parameters.  This  example-based  approach  to  estimat¬ 
ing  pose  parameters  for  simple  3D  objects  was  first  sug¬ 
gested  and  demonstrated  by  Poggio  and  Edelman  (1990, 
see  figure  4).  The  input  images  were  vectorized  by  estab¬ 
lishing  the  correspondence  between  feature  points.  As 
mentioned  earlier  we  have  now  used  a  correspondence 
technique  that  effectively  vectorizes  the  example  images 
-  as  well  as  the  novel  input  image  -  by  establishing  pixel- 
wise  correspondence  among  them. 

3.1  An  example:  estimating  expression  and 
pose  for  face  images 

We  will  describe  here  a  specific  experiment  in  which  a 
Gaussian  Radial  Basis  Function  network  is  trained  to 
estimate  face  rotation  and  degree  of  smiling  from  a  set 
of  four  examples. 

The  first  step  is  to  vectorize  the  four  examples  by  com¬ 
puting  pixel- wise  correspondence  between  them.  The 
steps  involved  are: 

1.  Two  anchor  points  such  as  the  two  eyes,  in  the  case 
of  faces,  are  matched  manually  or  automatically. 

2.  The  images  are  scaled  and  rotated  using  the  image- 
plane  transformation  determined  by  the  anchor 
points. 

3.  Pixelwise  correspondence  is  found  using  an  optical 
flow  algorithm,  such  as  the  one  described  later  in 
this  paper. 

Once  the  correspondence  step  is  performed,  each  of 
the  example  images  corresponds  to  a  vector  of  dimen¬ 
sionality  q  —  2  x  p  x  p.  where  p  x  p  is  the  number  of 
pixels  in  each  image. 

To  build  the  network  discussed  in  section  5.1,  one 
must  first  choose  a  basis  function  G  in  Equation  (1). 


While  different  choices  are  possible,  such  as  multi¬ 
quadrics  and  splines,  we  have  chosen  the  Gaussian  func¬ 
tion.  In  the  next  step,  the  "training”  procedure,  the  net¬ 
work  parameters  of  Equation  (1)  are  estimated.  In  our 
Gaussian  Radial  Basis  function  network,  the  Gaussian 
ff's  are  estimated  by  taking  75%  of  the  mean  distance 
between  all  pairs  of  example  inputs.  The  coefficients  are 
then  calculated  -  this  is  the  learning  stage  -  using  the 
pseudoinverse  solution  of  Equation  (4). 

At  run  time,  the  correspondence  step  must  be  per¬ 
formed  again  on  the  novel  input  image  to  establish  the 
correspondence  with  the  example  images.  The  novel  in¬ 
put  is  then  equivalent  to  a  vector  of  the  same  dimension¬ 
ality  q,  which  is  processed  by  the  network,  providing  an 
output  which  is  an  estimate  of  pose  for  the  novel  image. 

Figure  2  shows  four  example  images  we  used  to  train  a 
network  to  estimate  face  rotation  and  degree  of  smiling. 
Image  imgx  is  used  as  reference  by  the  correspondence 
process,  so  xj  =  0  and  the  other  examples  x* ,  i  =  2, 3, 4, 
are  represented  by  vectors  that  contain  as  components 
the  coordinates  of  each  pixel  in  imgl  relative  to  the  cor¬ 
responding  pixel  in  img ,  .  In  this  case  we  assigned  to  our 
four  examples  the  corners  of  the  unit  square  as  output 
values  (in  general  we  could  have  an  arbitrary  number  of 
examples  at  arbitrary  locations).  In  figure  3  we  show 
some  examples  of  rotation  and  expression  estimates  cal¬ 
culated  by  equation  (1).  As  we  will  discuss  in  section  7 
and  as  shown  in  figures  6  through  8,  some  of  the  appli¬ 
cations  of  combining  this  analysis  network  with  the  syn¬ 
thesis  network  include  low  bandwidth  teleconferencing 
and  using  a  human  actor  to  “direct”  a  cartoon  character 
or  another  human. 

4  The  Synthesis  Network:  learning  to 
generate  a  novel  image  as  a  function 
of  input  parameters 

In  this  section  we  discuss  an  example-based  technique 
for  generating  images  of  non-rigid  3D  objects  as  a  func¬ 
tion  of  input  parameters.  This  image  synthesis  task  -  a 
computer  graphics  task  -  is  the  inverse  of  the  problem 
of  image  analysis  -  a  computer  vision  problem  -  that  we 
have  discussed  in  the  previous  section. 

In  our  approach,  the  input  space  is  hand-crafted  by 
the  user  to  represent  desired  degrees  of  control  over  the 
images  of  the  3D  object .  After  assigning  example  images 
of  the  object  to  positions  in  parameter  space,  novel  (grey 
level  or  color)  images  are  synthesized  by  using  interpola¬ 
tion/approximation  networks  of  the  same  type  we  have 
described  for  the  analysis  problem.  In  this  paper  we  will 
demonstrate  the  technique  on  images  of  faces  with  input 
parameters  such  as  pose,  facial  expression,  and  person 
identity,  but  other  parameters  may  also  be  used. 

To  correctly  interpolate  the  shape  of  two  grey  level 
images,  we  first  find  pixelwise  correspondence  as  we  had 
described  above  for  the  analysis  case.  Through  the  cor¬ 
respondence  stage  each  image  is  associated  with  a  vector 
of  the  x,  y  coordinates  of  corresponding  pixels  as  in  the 
Poggio-Brunelli  technique.  Once  the  example  images  are 
vectorized  the  learning  stage  can  take  place. 

Following  the  Poggio-Brunelli  and  Librande-Poggio 


techniques,  the  user  then 

1.  associates  each  example  image  to  an  appropriate 
position  in  the  input  space  (for  instance,  pose  and 
expression  coordinates); 

2.  uses  a  multidimensional  Regularization  Network  to 
synthesize  the  mapping  between  inputs  and  out¬ 
puts. 

3.  displays,  that  is  renders,  the  vectorized  image  as  a 
grey-level  or  color  image 

The  technique  proposed  here  uses  networks  of  the  type 
described  later  (typically  with  n  =  N,  that  is  Regular¬ 
ization  Networks)  as  the  preferred  class  of  multidimen¬ 
sional  approximation/interpolation  technique.  A  specific 
choice  of  the  regularization  Green  function  provides  a 
specific  instance  of  regularization  networks:  examples 
are  Gaussian  Radial  Basis  Functions,  Multiquadric  Basis 
Functions  and  tensor  products  of  piecewise  linear  splines, 
all  of  which  were  used  in  the  Brunelli-Poggio  and  the 
Librande-Poggio  implementations. 

In  summary,  the  synthesis  network  is  trained  on  the 
examples  (y,,  x< ),  that  is  the  (vectorized)  images  and  the 
corresponding  pose  vectors.  After  the  learning  stage  the 
“trained"  network  can  synthesize  a  new  image  y  for  each 
pose  input  x  given  to  the  network. 

4.1  A  first  example:  one-dimensional 
interpolation  between  two  images 

The  simplest  use  of  the  technique  is  to  interpolate 
or  “morph”  between  two  face  images.  This  is  one¬ 
dimensional  interpolation  in  the  sense  that  there  is  just 
one  parameter  -  which  can  be  thought  of  as  time  or  de¬ 
gree  of  morph  or  (fuzzy)  identity  index  -  that  controls  the 
relative  contribution  of  just  two  example  images:  thus 
the  approximation  network  of  figure  1  has  in  this  case 
only  one  input  and  is  “trained”  using  just  two  exam¬ 
ples  (an  therefore  has  only  two  “hidden"  units).  In  this 
example  we  use  the  pixelwise  correspondences  to  inter¬ 
polate  with  linear  splines  -  that  is  using  a  regularization 
network  with  G(x)  —  |x|,  see  later  -  the  image  of  the 
face.  A  simple  cross  dissolve  is  used  to  interpolate  the 
grey  levels. 

Let  imgx  and  img-,  be  the  two  face  images  to  be  inter¬ 
polated  and  let  x  be  the  (input)  interpolation  parameter. 
Three  examples  of  ID  interpolation  are  shown  in  figure 

4.  In  the  first  and  last  the  input  parameter  is  similarity 
to  the  first  face:  the  figure  shows  a  “morph”  between 
two  different  people.  The  last  case  is  an  p  musing  failure 
case  where  the  optical  flow  algorithm  fails  to  find  cor¬ 
rect  correspondence.  The  second  demonstration  inter¬ 
polates  a  change  in  expression  and  slight  head  rotation, 
given  two  examples  of  the  same  face.  Thus,  the  simplest 
special  cases  of  our  interpolation  technique  is  morph¬ 
ing  between  two  similar  images.  Even  just  for  morph¬ 
ing  our  technique  is  considerably  more  automatic  than 
others.  Notice  that  the  use  of  optical  flow  algorithms  to 
perform  automatically  one  dimensional  interpolat  ion  be¬ 
tween  two  images  is  also  discussed  in  (Bergen-Hingorani. 
1990)  as  a  method  for  doing  frame  rate  conversion. 

Let  imgl  and  img2  be  the  two  face  images  to  interpo¬ 
late  and  let  x  be  the  interpolation  parameter.  We  then 


compute  the  interpolated  image  imginier(x)  by 

y  —  vect  (img2,trngl) 
imgint{x)  —  (1  -  x)rend(xy,«'m<7j,0)  + 
xrend(xy,im<j2,y) 

Note  that  imgl  is  reproduced  exactly  when  x  =  0:  img2 , 
when  x  =  1. 

4.2  Multidimensional  synthesis  of  face  images 

Our  main  result  here  is  the  first  demonstration  that 
approximation/interpoiation  techniques  can  be  used  to 
synthesize  dense  grey  level  images  for  a  multidimensional 
input  space. 

Figure  2  shows  the  four  examples  of  a  face  and  their 
pose  parameters  in  a  2-dimensional  input  space,  the  di¬ 
mensions  of  which  are  designed  to  represent  the  vari¬ 
ability  present  amongst  the  examples.  The  assignment 
of  each  image  to  a  location  in  the  input  space  was  de¬ 
cided  by  the  user.  The  first  step  is  to  compute  corre¬ 
spondence  of  images  img2 ,  img3,  and  img4  with  respect 
to  reference  image  img1.  We  then  use  bilinear  inter¬ 
polation  among  four  examples  to  synthesize  novel  im¬ 
ages.  each  corresponding  to  a  value  of  the  2-dimensional 
pose  vector.  The  process  is  equivalent  to  using  tensor 
product  of  piecewise  linear  splines  in  two  dimensions 
(which  is  a  regularization  network  with  Green  function 
G(x,  y)  =  |z||y|,  see  Appendix  A).  In  our  demonstration, 
the  network  generates  new  vector  images  for  "any”  de¬ 
sired  value  of  rotation  and  expression,  the  coordinates  of 
each  pixel  corresponding  to  the  coordinates  of  so-called 
output  "control”  point  in  the  Poggio-Brunelli-Librande 
description.  We  have  rendered  the  vector  which  is  the 
output  of  the  network  into  a  grey-value  image  by  bi- 
linearly  interpolating  the  grey  level  values  between  cor¬ 
responding  pixels  of  the  example  images.  In  figure  5, 
the  example  images,  which  are  bordered  in  black,  have 
been  placed  at  the  corners  of  the  unit  square  in  (x,y) 
rotation/expression  space.  All  the  other  images  are  syn¬ 
thesized  by  the  network  described  above. 

To  synthesize  an  image  at  coordinates  (x,  y),  we  first 
vectorize  the  examples  using  img x  as  a  reference 

y i  —  vect (img(,  *my, ), 

where  yi  will  be  assigned  the  zero  vector.  Then  we  use 
bilinear  interpolation  to  compute  the  new  face  geometry 
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y  int(r.y)  —^2bi(x,y)yi, 

«=i 

where  6,-(x,  y)  is  the  bilinear  coefficient  of  example  i  (see 
Appendix  A) 

6i  =  (1  -  x)(l  -  y)  &2  =  x(l-y) 

63  =  (1— x)y  b4  =  xy. 

Finally,  we  render  the  interpolated  image  imgini  using 
contributions  from  the  facial  texture  of  all  examples 
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im9inAx<y)  —  ^M*,y)rend(yint,iro<7t,y1). 

i=i 


Thus,  given  a  set  of  example  views,  we  can  synthesize 
any  intermediate  view  by  using  interpolation.  In  the  ex¬ 
amples  of  this  section  we  have  used  a  two  dimensional 
input  space.  As  it  is  clear  from  the  theory  (see  examples 
in  Librande,  1992),  the  input  space  may  have  any  dimen¬ 
sion,  provided  there  is  a  sufficient  number  of  examples. 
We  have  successfully  generated  face  images  in  a  three 
dimensional  input  space  of  pose-expression  parameters. 

Together,  the  analysis  and  synthesis  networks  can  be 
used  for  low-bandwidth  teleconferencing,  effectively  us¬ 
ing  the  analysis  parameters  -  that  is  the  output  of  the 
analysis  network  -  as  a  model- based  image  code.  Before 
the  teleconferencing  session,  the  transmitter  sends  the 
receiver  the  information  it  needs  to  initialize  the  syn¬ 
thesis  network  -  the  example  images  and  their  assigned 
locations  in  pose-expression  space.  During  the  telecon¬ 
ferencing  session,  the  transmitter  feeds  each  image  frame 
through  the  analysis  network,  transmits  only  the  analy¬ 
sis  parameters,  and  then  the  receiver  uses  the  synthesis 
network  to  reconstruct  the  image.  Figure  6  demonstrates 
the  process  for  the  set  of  example  images  in  figure  2  and 
our  2D  rotation-smiling  parameter  space  (x,y). 

While  figure  6  uses  the  same  set  of  images  at  both 
the  transmitter  and  receiver,  in  general  we  can  train  the 
synthesis  network  with  a  different  set  of  example  im¬ 
ages.  For  example,  a  stored  set  of  examples  taken  while 
"dressed  up”  can  used  to  always  put  a  person's  “best 
face  forward”  at  the  receiving  end,  even  if  the  person 
is  informally  dressed  at  the  transmitter.  To  make  this 
work,  the  only  requirement  is  that  the  two  example  sets 
are  parameterized  by  the  same  control  space.  In  another 
example,  by  putting  a  set  of  example  images  of  another 
person  at  the  receiver,  the  analysis/synthesis  network 
can  be  used  to  "direct”  this  new  person.  For  instance, 
in  figure  8,  the  receiver  is  trained  with  the  example  set 
of  figure  7,  enabling  the  person  at  the  transmitter  to 
direct  the  person  in  this  new  example  set.  If  we  use  a 
set  of  cartoon  character  images  at  the  receiver,  our  ap¬ 
proach  for  teleconferencing  also  becomes  a  method  for 
performance-driven  animation. 

5  Theoretical  background 

The  techniques  described  in  this  paper  are.  to  a  large 
extent,  the  result  of  an  integration  of  two  existing  sets 
of  methods: 

1.  How  to  establish  automatically  pixel-wise  corre¬ 
spondence  between  grey  level  images. 

2.  How  to  approximate  maps  from  a  vector  space  to 
a  vector  field. 

The  correspondence  problem  is  typically  seen  in  the 
field  of  motion  measurement,  stereopsis  and  structure 
from  motion.  Recent  advances  in  visual  recognition  in 
terms  of  interpolation  networks  (Poggio  k  Girosi  1990: 
Edelman  k  Poggio  1990,  Brunelli  and  Poggio,  1991), 
alignment  techniques  (cf.  Ullman  1986;  Huttenlocher  k 
Ullman  1990;  Ullman  k  Basri  1991),  and  methods  using 
geometric  invariants  across  multiple  views  (cf.  Mundy 
et.  al  1992;  Weinshall  1993;  Shashua  1993b),  have  also 
put  new  emphasis  on  achieving  correspondence  between 
model  images  stored  in  memory,  prior  to  the  actual 
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recognition  of  novel  instances  of  objects.  We  propose 
to  use  correspondence  not  only  in  image  analysis  but 
also  in  computer  graphics,  by  providing  the  ability  to 
fully  register  example  images  of  an  object  for  later  syn¬ 
thesis  of  new  examples.  The  correspondence  techniques 
we  used  will  be  described  in  Section  5.2. 

The  second  set  of  methods  uses  non-linear  interpola¬ 
tion  (or  approximation)  from  a  vector  space  to  a  vector 
field  using  generalized  regularization  networks  (Poggio 
k  Girosi,  1990).  The  basic  idea,  proposed  by  Poggio 
and  Edelman  (1990),  is  to  view  the  problem  of  estimat¬ 
ing  pose  from  an  object  image  as  a  non-linear  mapping 
between  2D  images  (examples)  and  the  corresponding 
control  parameters,  such  as  pose  and  facial  expressions. 
Poggio  k  Brunelli  (1992)  suggested  that  the  synthesis 
problem  could  be  viewed  in  terms  of  the  associate,  in¬ 
verse  mapping. 

The  connection  between  this  and  correspondence  is 
that  the  2D  example  images  must  be  fully  registered  with 
each  other  prior  to  the  interpolation  process.  We  will 
discuss  regularization  networks  in  Section  5.1. 

In  Section  5.3  we  point  out  that  regularization  net¬ 
works  can  be  used  to  map  directly  vector  fields  onto 
vector  fields.  In  other  words,  instead  of  mapping  2D  im¬ 
ages  onto  control  parameters  and  then  into  images  by  an 
analysis-synthesis  sequence  of  steps,  we  map  2D  images 
directly  onto  other  2D  images.  We  show  that  this  may  be 
useful  in  teleconferencing  applications  and  propose  con¬ 
nections  between  that  mapping  and  recent  techniques  in 
visual  recognition  applied  to  non-rigid  objects  (Ullman 
k,  Basri  1991;  Poggio,  1990)  and  certain  classes  of  ob¬ 
jects  called  “Linear  Classes”  (Poggio  k  Vetter,  1992). 

5.1  Approximation  of  vector  fields  through 
regularization  networks 

Consider  the  problem  of  approximating  a  vector  field 
y(x)  from  a  set  of  sparse  data  -  the  examples,  which 
are  pairs  (y,,x,),  i  =  l  •  N.  Choose  a  Regularization 
Network  or  a  Generalized  Regularization  Network  (see 
Girosi,  Jones  and  Poggio,  1993)  as  the  approximation 
scheme,  that  is  a  network  with  one  “hidden”  layer  and 
linear  output  units  (see  for  instance  Poggio  and  Girosi, 
1989).  Consider  the  case  of  N  examples,  n  <  N  cen¬ 
ters,  input  dimensionality  d  and  output  dimensionality 
q.  Then  the  approximation  is 

n 

y(x)  =  £c<G(x-xt)  (1) 

i=i 

with  G  being  the  chosen  Green  function,  which  may  be 
a  radial  basis  function,  like  the  Gaussian,  or  a  spline, 
like  some  tensor  product  spline.  The  equation,  which  is 
equivalent  to  the  network  of  figure  1,  can  be  rewritten 
in  matrix  notation  as 

y(x)  =  Cg(x)  (2) 

where  g  is  the  vector  with  elements  gi  =  G(x  —  x<). 

Let  us  define  as  G  the  matrix  of  the  chosen  Green  func¬ 
tion  evaluated  at  the  examples,  that  is  the  matrix  with 
elements  G,  j  =  G'(x,  —  x; )  .  Then  the  “weights”  c  are 
“learned”  from  the  examples  by  solving 


Y  =  CG.  (3) 

where  Y  is  defined  as  the  matrix  in  which  column  /  is 
the  example  yj .  C  is  defined  as  the  matrix  in  which  row 
m  is  the  vector  cm .  This  means  that  x  is  a  d  x  1  matrix, 
C  is  a  q  x  n  matrix,  Y  is  a  q  x  N  matrix  and  G  is  a 
n  x  N  matrix.  Then  the  set  of  weights  C  is  given  by 

C  =  YG+.  (4) 

5.2  The  Correspondence  Problem  for 
Grey- level  Images 

While  the  general  approximation  scheme  outlined  above 
can  be  used  to  learn  a  variety  of  different  mappings, 
in  this  paper  we  discuss  mappings  between  images  of 
objects  and  their  parameters,  such  as  pose  or  expres¬ 
sion,  and  between  images  onto  themselves  (notice  that 
all  the  techniques  of  this  paper  generalize  directly  from 
2D  views  to  3D  views  -  that  is  3D  models  -  of  an  object). 
In  image  analysis,  the  approximation  techniques  can  be 
used  to  map  input  images  x  into  output  pose  and  ex¬ 
pression  vectors  y.  They  can  also  be  used  to  synthesize 
the  map  from  input  pose  vectors  x  to  output  images  y 
for  image  synthesis.  The  correspondence  problem  is  the 
same  in  both  cases.  Our  approach  critically  relies  on  the 
ability  to  obtain  correspondence  between  two  example 
images. 

The  choice  of  features  (components  of  the  vectors  y )  is 
crucial  both  in  obtaining  a  smooth  mapping  for  the  reg¬ 
ularization  network,  and  for  the  type  of  correspondence 
method  that  is  to  be  used.  One  of  the  most  natural 
choices  -  which  satisfies  the  requirements  of  smoothness 
of  the  input-output  mapping  -  is  a  vectorization  of  the 
image  in  terms  of  the  (x ,y)  locations  of  image  features. 
For  instance,  the  image  of  a  face  with  n  features,  such 
as  the  “inner  corner  of  the  left  eye" ,  will  be  represented 
by  a  vector  of  length  2n,  which  is  simply  the  concatena¬ 
tion  of  all  the  features’  x  coordinates  followed  by  the  y 
coordinates.  This  is  a  representation  of  shape;  color  or 
grey  level  information  can  be  tagged  to  the  feature  points 
as  auxiliary  information  (and  additional  components  of 
the  y  vector).  It  is  important  to  stress  that  many  other 
representations  are  possible  and  have  been  used.  In  any 
case  a  sparse  x,  y  representation,  originally  suggested  for 
object  recognition  (see  for  instance  Ullman  and  Basri, 
1989)  was  also  used  by  Poggio  and  Brunelli  (1992)  for 
synthesis  of  real  images  in  which  features  (in  the  order  of 
20  or  so)  were  located  and  brought  into  correspondence 
manually.  The  same  representation  has  been  recently  ap¬ 
plied  to  generating  images  of  hand-drawn  colored  illus¬ 
trations  (Librande,  1992).  The  feature  points  are  much 
denser,  on  the  order  of  hundreds,  and  are  placed  along 
the  contours  of  objects.  The  correspondence  problem  is 
alleviated  somewhat  by  using  algorithms  for  automatic 
contour  tracing  and  matching  -  only  correspondences  of 
a  few  points  per  contour  need  to  be  specified.  Texture 
mapping  was  used  by  Poggio  and  Brunelli  as  well  as  by 
Librande  to  render  the  image  from  the  feature  points. 

In  this  paper  we  suggest  using  the  densest  possible 
representation  -  one  feature  per  pixel,  originally  sug¬ 
gested  for  visual  recognition  (Shashua,  1991).  This 
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avoids  the  need  of  using  texture  mapping  techniques  at 
the  expense  of  solving  a  difficult  correspondence  prob¬ 
lem.  Morphing  in  computer  graphics  (Beier  and  Neely, 
1992)  faces  a  similar  correspondence  problem  that  is  typ¬ 
ically  solved  with  time-consuming  manual  intervention. 

We  have  discovered  that  standard  optical  flow  algo¬ 
rithms,  preceded  by  certain  normalization  steps,  can  do 
a  good  job  at  automatically  computing  dense  pixel-wise 
correspondences  with  the  images  we  have  used  so  far. 
This  ability  to  automatically  compute  correspondences 
in  grey  level  images  makes  practical  the  “vectorization" 
of  a  grey  level  image,  that  is  the  computation  of  a  vector 
y  associated  to  it  and  which  describes  the  position  of 
each  pixel  relative  to  a  chosen  reference  image.  There 
are  many  similar  optical  flow  algorithms  that  perform  in 
similar  ways.  We  describe  one  which  we  have  used  in  the 
experiments  described  here.  We  also  refer  in  the  rest  of 
the  paper  to  face  images,  though  the  technique  should 
work  for  a  variety  of  3D  objects. 

The  coarse-to-fine  gradient-based  optical  flow  algo¬ 
rithm  used  in  the  examples  of  this  paper  follows  (Lu¬ 
cas  k  Kanade  1981;  Bergen  k  Adelson  1987;  Bergen 
Ac  Hingorani  1990)  and  is  applied  after  a  normalization 
stage  for  compensating  for  certain  image  plane  trans¬ 
formations.  In  our  face  images  we  chose  manually  the 
eye  centers  at  the  center  of  the  iris  (Brunelli  and  Poggio 
1992,  Yuille  et  al.  1989,  Stringa  1991,  Beymer  1993,  de¬ 
scribe  automatic  techniques  for  finding  the  eyes).  This 
corrects  for  differences  in  scale,  translation,  and  image- 
plane  rotation  of  the  two  face  images.  An  important 
byproduct  of  this  normalization  step  is  that  the  two  im¬ 
ages  are  brought  closer  together,  making  the  remaining 
correspondence  problem  slightly  easier.  Once  the  im¬ 
ages  are  in  rough  geometric  alignment,  the  remaining 
displacement  between  the  two  frames  is  found  using  the 
a  standard  gradient-based  optical  flow  method  (we  have 
also  used  other  algorithms  that  do  not  use  gradient  com¬ 
putations  with  similar  or  better  results). 

If  we  assume  the  remaining  displacement  between  the 
two  images  is  sufficiently  small,  and  that  the  brightness 
value  of  corresponding  points  does  not  change  much, 
then  we  have  the  following  equation,  known  as  the  “con¬ 
stant  brightness  equation”  (Horn  k  Schunk,  1981): 

V/  •  v  +  I,  =  0,  (5) 

where  V/  is  the  gradient  at  point  p,  and  It  is  the  tem¬ 
poral  derivative  at  p,  and  v  is  the  displacement  vector 
viewed  as  a  velocity  vector.  This  equation  describes 
a  linear  approximation  to  the  change  of  image  grey- 
values  at  p  due  to  image  motion  and  is  obtained  by  a 
first  order  approximation  of  a  Taylor  series  expansion 
of  I(x  -I-  dx,y  +  dyj  +  dt)  =  I(x,y,t).  Since  the  so¬ 
lution  of  this  problem  is  underdetermined  at  a  point  - 
one  equation  for  two  unknowns  -  the  additional  con¬ 
straint  of  flow  smoothness  is  made  and  the  solution  is 
computed  over  small  neighborhoods.  A  coarse-to-fine 
strategy,  currently  implemented  by  Laplacian  Pyramids 
(Burt  k  Adelson,  1983),  estimates  displacements  at  the 
coarser  levels  of  the  pyramid  and  refines  the  estimates 
as  finer  levels  are  processed.  This  enables  fast  process¬ 
ing  while  still  being  able  to  find  large  displacements,  for 


a  large  displacement  at  a  high  resolution  is  a  small  dis¬ 
placement  at  lower  resolutions.  More  details  of  this  par¬ 
ticular  implementation  can  be  found  in  Bergen  A:  Hin¬ 
gorani  (1990). 

5.3  Mapping  between  Vector  Fields:  The  Case 
of  Linear  Combination 

We  have  described  techniques  for  mapping  fully  regis¬ 
tered  2D  images  onto  corresponding  control  parameters, 
such  as  pose  and  expressions  of  a  face.  The  mapping  can 
be  later  on  used  in  the  synthesis  module  to  generate  new 
images,  say  of  a  face,  based  on  input  of  novel  parameter 
settings,  such  as  novel  poses  and  novel  facial  expressions. 
Here  we  suggest  an  alternative  technique,  with  potential 
applications  to  teleconferencing  (see  later  this  section. 
Section  7  and  Appendices  B  and  C),  in  which  2D  images 
are  mapped  directly  onto  2D  images. 

Continuing  the  derivations  of  Section  5. 1 ,  we  note  (see 
also  Girosi,  Jones  and  Poggio,  1993)  that  the  vector  field 
y  is  approximated  by  the  network  as  the  linear  combi¬ 
nation  of  the  example  fields  y /  ,  that  is 

y(x)  =  YG+g(x)  (6) 

which  can  be  rewritten  as 

N 

y(x)  =  ^fc/(x)y,  (7) 

i=i 

where  the  bt  depend  on  the  chosen  G,  according  to 

b(x)  =  G+g(x).  (8) 

This  derivation  readily  suggests  that  instead  of  estimat¬ 
ing  x  corresponding  to  the  “novel"  image  yd ,  we  can 
estimate  the  “best”  coefficients  b|  such  that 

N 

yd  =  £%yi.  (9) 

/=i 

and  then  use  the  estimated  6j  -  that  is  b  =  Y+yd  for 
reconstructing  yd.  Therefore,  in  cases  where  we  are  not 
interested  in  the  vector  x  of  control  parameters,  we  can 
reconstruct  the  novel  image  by  simply  combining  the 
available  examples. 

The  natural  application  is  teleconferencing,  where  the 
novel  image  is  given  at  the  sender  site  and  the  receiver 
only  requires  the  coefficients  6|  for  combining  the  exam¬ 
ple  images.  In  figure  10,  we  use  the  example  set  of  figure 
9  and  equation  (9)  to  write  each  novel  image  on  the  left 
directly  in  terms  of  the  three  vectorized  example  im¬ 
ages  y2,  y3,  and  y,j.  (Since  img1  is  the  reference  image, 
yi  =  0,  and  it  makes  no  contribution  to  equation  (9).) 
The  reconstructed  image,  shown  on  the  right,  is  synthe¬ 
sized  by  recalculating  the  vectorized  image  y recon  (using 
equation  (9))  and  then  rendering  it  using  the  image  tex¬ 
ture  from  tmgtj  (imgrecon  —  rend(yrecot),  im^.O)).  In 
the  figure  we  show  two  sets  of  6/  coefficients  since  the  x 
and  y  components  of  yd  are  represented  separately  using 
equation  (9). 

This  approach  to  teleconferencing  can  also  be  used  to 
“direct”  another  person  when  the  receiver  has  another 
example  set,  as  we  demonstrate  in  figure  11.  Compared 


to  using  the  analysis/synthesis  approach,  however,  there 
are  tighter  constraints  coupling  the  transmitter  and  re¬ 
ceiver  examples.  Since  the  coefficients  are  tied  to  specfic 
examples  instead  of  an  abstract  pose-expression  param¬ 
eter,  the  examples  in  both  the  transmitter  and  receiver 
sets  must  be  at  the  same  locations  in  pose-expression 
space.  In  this  sense  the  analysis/synthesis  network  is 
more  flexible,  as  the  abstraction  provided  by  the  input 
space  parameterization  allows  transmitter  and  receiver 
examples  to  fall  at  different  locations.  Section  7  and 
Appendices  B  and  C  present  more  details  on  our  two 
techniques  for  teleconferencing. 

Equation  7  provides  the  connection  between  the  two 
techniques,  and  we  have  the  following  result: 

For  any  choice  of  the  regularization  network  (even  a 
Generalized  Regulartzation  Network,  see  Girosi.  Jones 
and  Poggto.  1998)  and  any  choice  of  the  Green  func¬ 
tion  -  including  Green  functions  corresponding  to  addi¬ 
tive  splines  and  tensor  product  spltnes  -  the  estimated 
output  (vector)  image  is  always  a  linear  combination  of 
example  (vector)  images  with  coefficients  b  that  depend 
(nonhnearly)  on  the  input  value. 

The  result  is  valid  for  all  networks  of  the  type  shown  in 
figure  1,  provided  that  a  L 2  criterion  is  used  for  training. 

This  technique  is  intimately  connected  with  two  con¬ 
cepts  in  visual  recognition:  one  is  the  linear  combina¬ 
tion  of  views  of  Ullman  k  Basri  (1991),  and  the  other  is 
the  linear  class  of  objects  proposed  by  Poggio  k  Vetter 
(1992).  These  connections,  which  we  discuss  below,  are 
a  particular  case  of  the  general  result  described  above, 
but  on  the  other  hand  are  sharp  (under  certain  assump¬ 
tions),  rather  than  being  an  approximation. 

The  following  arguments  suggest  that  approximating 
a  2D  image  as  the  linear  combination  of  a  small  set  of 
“example”  images  is  exact  for  a  special  class  of  non- 
rigid  objects  provided  that  the  views  are  represented  as 
vectors  of  x,  y  coordinates  of  visible  “features”  and  image 
projection  is  orthographic. 

As  a  corollary  of  Ullman  k  Basri’s  linear  combination 
result  one  can  easily  show  the  following  two  related,  but 
different,  results.  First,  consider  objects  composed  of  k 
“parts”,  and  consider  the  set  of  images  of  such  an  ob¬ 
ject  —  which  is  defined  as  a  result  of  having  each  part 
undergo  an  arbitrary  3D  affine  transformation,  followed 
by  a  projection  onto  the  image  plane.  The  space  of  all 
orthographic  images  of  such  an  object  form  a  3 k  dimen¬ 
sional  space,  and  therefore,  can  be  spanned  by  linearly 
combining  3F  images  of  the  object  (x  and  y  components 
separately).  Note  that  segmentation,  or  decomposition 
of  the  image  into  parts,  is  not  required  —  only  that  we 
have  correspondence  between  the  3 k  example  images. 
This  observation  was  made  by  Basri  ( 1990  )  in  the  con¬ 
text  of  articulated  objects  (objects  composed  of  links, 
like  scissors),  but  may  be  also  valid  to  more  natural  ob¬ 
jects. 

For  the  second  result,  consider  a  collection  of  k  dif¬ 
ferent  objects  (presumably  coming  from  the  same  class 
of  objects,  like  the  class  of  faces,  but  this  is  not  theoret¬ 
ically  necessary),  and  consider  the  class  of  objects  that 


can  be  obtained  by  linearly  combining  the  corresponding 
coordinates  (in  3D)  across  the  k  objects  (Vetter  and  Pog¬ 
gio  1992,  refer  to  this  as  the  “Linear  Class  of  Objects”). 
For  example,  the  linear  class  of  a  square  and  a  trian¬ 
gle  includes  a  trapezoid  as  a  particular  case  (note  that 
a  trapezoid  cannot  be  obtained  by  means  of  an  ortho¬ 
graphic  view  of  a  square).  Next  consider  3 k  “example" 
images  of  these  objects  —  three  distinct  orthographic 
views  per  object.  It  is  a  simple  matter  to  show  that  the 
set  of  all  views  of  the  linear  class  (views  of  the  k  objects, 
and  all  views  of  all  objects  generated  by  linear  combi¬ 
nations  of  those  objects)  are  spanned  by  the  set  of  3 k 
examples. 

These  results  express  two  different  aspects  of  the  same 
method  of  approximating  an  image  by  combining  a  small 
set  of  example  images.  In  one  aspect,  we  view  these  im¬ 
ages  as  views  of  a  non-rigid  object  that  is  assumed  to 
be  composed  of  a  finite  (small)  number  of  rigid  compo¬ 
nents.  Alternatively,  we  may  view  the  images  as  views 
of  different  objects  (presumably  related  to  each  other, 
such  as  faces).  For  example,  we  may  view  a  face  as  a 
non-rigid  object,  view  facial  expressions  as  a  linear  class 
of  example  images  of  the  same  face,  or  view  the  class 
of  faces  as  a  linear  class.  In  all  cases  we  use  the  same 
method  of  approximation  (which  is  exact  when  the  con¬ 
ditions  above  hold).  Finally  we  note  that  the  problem 
of  correspondence  between  the  example  images  receives 
different  interpretations  depending  on  how  we  view  the 
transformation  space  of  objects.  For  example,  in  the  case 
we  adopt  the  view  of  having  a  non-rigid  object ,  then  the 
problem  of  correspondence  is  in  theory  well  defined  (just 
as  in  the  rigid  case);  in  case  we  assume  a  linear  class  of 
objects,  then  correspondence  is  theoretically  ill-defined 
because  we  have  no  longer  the  projections  of  a  fixed  set 
of  points  in  space.  In  practice,  however,  we  observed 
that  obtaining  correspondence,  using  the  methods  de¬ 
scribed  in  Section  5.2,  between  images  of  the  same  face 
undergoing  facial  expressions  is  relatively  easy.  There 
are  many  cases  however  in  which  the  correspondence  al¬ 
gorithm  fails,  sometime  with  funny  effects  (see  figure  4). 
Obtaining  correspondence  between  images  of  different 
faces  is  also  possible  but  in  several  cases  may  require 
more  interactive  methods  than  the  one  used  in  our  im¬ 
plementation. 

On  a  different  level,  notice  that  the  argument  based 
on  approximation  does  not  say  anything  about  the  num¬ 
ber  of  examples  needed  for  a  satisfactory  performance, 
whereas  the  other  argument  provides  such  an  estimate, 
at  least  in  principle  (an  analysis  of  the  effect  of  noise  is 
still  lacking).  The  approximation  argument  on  the  other 
hand  does  not  depend  on  the  particular  view  represen¬ 
tation,  as  long  as  the  mapping  between  the  pose  input 
and  the  view  output  is  smooth  (see  Poggio  and  Brunelli, 
1992  for  the  argument  and  an  example).  Thus  not  only 
x,y  coordinates  can  be  used  but  other  attributes  of  the 
feature  points  such  as  color  and  slant  as  well  as  angles  or 
distance  between  feature  points  and  even  non  geometric 
attributes  such  as  the  overall  luminance  of  the  image. 
Orthographic  as  well  as  perspective  projection  can  be 
dealt  with.  The  linear  combination  results,  on  the  other 
hand,  depend  on  orthographic  projection  and  especially 


critically  on  the  specific  view  representation,  based  on 
x.y  coordinates  of  visible  feature  points. 

6  Networks  that  generate  novel 
grey-level  images  from  a  single 
example  by  learning  class-specific 
transformations 

Given  a  set  of  example  images  of  an  object,  such  as  a 
face,  we  have  discussed  how  to  synthesize  new  images  of 
that  face  for  different  poses  and  expressions,  given  a  suf¬ 
ficient  number  of  examples  of  that  specific  face.  Suppose 
now  that  only  one  image  of  a  specific  face  is  available. 
How  can  we  synthesize  novel  views  of  that  particular 
face  from  just  one  "model”  view?  A  natural  idea  is  to 
use  a  collection  of  example  views  of  another  person  p 
(or  perhaps  the  average  of  a  small  set  of  people)  as  a 
prototype  for  representing  generic  face  transformations. 
If  we  bring  the  image  of  a  face  imgneu,  into  correspon¬ 
dence  with  the  closest  prototype  image,  then  we  may  be 
able  to  synthesize  images  of  the  new  person  as  we  would 
with  the  prototype  from  just  one  view  of  the  new  person. 
In  general,  we  want  to  generate  from  one  2D  view  of  a 
3D  object  other  views,  exploiting  knowledge  of  views  of 
other  objects  of  the  same  class. 

This  idea  of  generating  “virtual”  views  of  an  object  by 
using  class-specific  knowledge  has  been  discussed  before 
in  (Poggio  1991,  see  also  Poggio  and  Vetter,  1992  ).  In 
our  preliminary  work  we  have  applied  the  method  used 
by  Poggio  and  Brunelli  (1992)  to  generate  a  walking  an¬ 
imation  sequence  of  Carla  from  just  one  image  of  Carla 
and  a  sequence  of  frames  of  the  walking  "prototype” 
Roberto.  A  view  y  of  Carla  or  Roberto  is  a  vector  of 
x,  y  coordinates  of  manually  chosen  corresponding  points 
near  the  joints  and  limb  extremities.  Suppose  that  we 
have  two  views  of  the  prototype,  yre/  and  yp,  the  second 
of  which  is  a  transformed  version  of  the  first,  for  instance 
a  rotated  view  of  yre; .  This  prototype  transformation 
can  be  simply  represented  by  the  vector  difference 

Ayp  =  yp-y  „j-  (10) 

Now  let  ynov  be  a  view  of  object  nov  that  appears  in  the 
same  pose  as  y rej.  Poggio  and  Brunelli  then  generate 
the  transformed  view  of  object  nov ,  yp+not, ,  by 

yP+n  ov  —  ynov  +  Ayp  (11) 

A  new  grey  level  image  is  then  generated  by  texture  map¬ 
ping  from  the  original  image  of  nov. 

We  have  generated  “virtual”  views  of  a  face  by  using 
the  same  technique  applied  to  individual  pixels  rather 
than  sparse  control  points.  In  figure  12,  we  have  used 
the  optical  flow  between  a  pair  of  prototype  examples, 
imgrej  and  imgp ,  to  “learn”  face  rotation  and  change  of 
expression.  Once  a  new  face  image  imgnov,  is  brought 
into  correspondence  with  imgrej ,  we  can  apply  the  pro¬ 
totype  transformation  to  the  new  image,  synthesizing 
i^dp+nov  ■ 

To  generate  a  virtual  view  using  our  notation,  proto¬ 
type  transformations,  as  represented  by  a  vectorized  pro¬ 
totype  example  yp,  need  to  be  mapped  onto  a  novel  face. 


If  the  novel  face  is  vectorized  using  the  prototype  refer¬ 
ence  image,  producing  yno„,  this  can  be  done  by  simply 
summing  the  two  vectorized  representations.  Consider 
figure  12,  where  the  prototype  rotation  transformation 
is  represented  by  the  vectorized  imgp  using  hngrtj  as 
reference 

yp  —  vect(imgp ,  imgref). 

Vectorizing  the  novel  face  image  imgnot.  using  imgr(j 
as  reference  produces  a  "identity  changing”  transform, 
specifying  how  to  deform  the  face  geometry  of  the  pro¬ 
totype  into  the  novel  face 

y„ot,  —  vect  (imgnov,imgrfJ). 

The  composition  of  the  rotation  transformation  yP  and 
the  identity-changing  transformation  yn0l,  is  simply  the 
sum 

yp+not  *  yp  “h  Ynot  • 

We  can  interpret  the  last  equation  in  terms  of  altering 
the  geometry  of  the  reference  image  by  moving  pixels 
around.  The  composite  transform  moves  a  pixel  first  by 
the  prototype  transform  and  then  by  the  identity  chang¬ 
ing  transform.  Finally,  we  can  render  the  composite  ge¬ 
ometry  using  the  facial  texture  from  the  novel  image 

im9p+nov  —  rend(yP+„  OV  i  img„ 

ov  i  y  n  ov  )  • 

This  produces  the  image  shown  in  the  lower  right  of  fig¬ 
ure  12. 

An  interesting  idea  to  be  explored  in  the  near  future  is 
the  use  of  the  alternative  technique  proposed  by  Poggio 
and  Vetter  (1992)  for  generating  virtual  views  based  on 
the  concept  of  linear  classes,  mentioned  earlier. 

In  any  case,  a  sufficient  number  of  prototype  transfor¬ 
mations  -  which  may  involve  shape,  color,  texture  and 
other  image  attributes  by  using  the  appropriate  features 
in  the  vectorized  representation  of  images  -  should  in 
many  cases  allow  the  generation  of  more  than  one  vir¬ 
tual  view  from  a  single  “real”  view.  In  this  way  a  few 
prototypical  views  may  be  used  to  generate  correspond¬ 
ing  virtual  views  starting  from  just  a  single  image.  The 
resulting  set  of  virtual  examples  can  then  be  used  to  train 
a  network  for  either  synthesis  or  analysis  ( an  alternative 
way  to  achieve  the  same  result  is  to  use  the  virtual  views 
generated  by  a  prototypical  network  such  a  the  one  of 
figure  5  to  transform  the  novel  face  appropriately).  The 
applications  of  this  technique  are  many.  One  of  them  is 
object  recognition  from  a  single  model  image.  Prelimi¬ 
nary  work  by  one  of  us  (Beymer,  1993,  thesis  proposal) 
has  shown  the  feasibility  of  the  basic  principle  in  the  case 
of  face  recognition. 

7  Discussion 

Two  key  ingredients  played  a  major  role  in  this  paper. 
First,  we  proposed  an  approach  that  can  be  charac¬ 
terized  as  memory-based  or  leamtng-from-examples  for 
both  the  analysis  problem  and  for  the  synthesis  problem 
of  computer-graphics  -  instead  of  the  classical  physics- 
based  and  3D  model-based  approach.  Second,  we  have 
introduced  optical-flow  as  a  tool  for  our  methods,  to 
achieve  full-correspondence  between  example  images.  In 


addition,  we  have  established  a  theoretical  and  prac¬ 
tical  connection  between  multi-dimensional  approxima¬ 
tion  techniques  and  linear  mappings  between  vector 
fields  (linear  combination  of  views,  linear  class  of  ob¬ 
jects).  These  ingredients  together  were  the  basis  for  the 
analysis  and  the  synthesis  networks  and  for  several  appli¬ 
cations  which  we  have  already  mentioned  and  will  outline 
in  this  section. 

The  problem  of  estimating  pose  parameters  from  the 
image  of  a  3D  object  is  usually  solved  in  the  realm  of 
computer  vision  by  using  appropriate  3D  models  or  even 
physics-based  models.  For  instance,  pose  and  expres¬ 
sion  of  faces  can  be  estimated  by  fitt  ing  a  generic  model 
of  a  face  to  the  image  of  a  specific  face  (Aizawa,  Ha- 
rashima  and  Saito,  1989;  Poggio,  1991).  Instead,  we  use 
a  short  cut  in  the  sense  that  we  use  a  set  of  images  as 
examples  and  completely  avoid  t  he  problem  of  a  physics- 
based  model.  Similarly,  the  classical  synthesis  problem 
of  computer  graphics  is  the  problem  of  generating  novel 
images  corresponding  to  an  appropriate  set  of  control 
parameters  such  as  pose,  illumination  and  the  expres¬ 
sion  of  a  face.  The  traditional  approach  in  computer 
graphics  involves  3D  modeling  and  rendering  techniques 
which  effectively  simulate  the  physics  of  imaging  as  well 
as  rigid  and  non-rigid  body  motions.  For  example,  the 
interpretation  and  synthesis  of  facial  expressions  would 
classically  follow  from  an  understanding  of  facial  mus¬ 
cle  groups  in  terms  of  their  role  in  generating  specific 
expressions  and  their  interdependency  (cf.  Ekman  A 
Friesen  1978;  Ekman  1992;  Essa  1993).  Our  alternative 
approach  to  graphics  may  be  effective  in  the  sense  that  it 
may  provide  a  short-cut  by  trading  the  modeling  compo¬ 
nent  and  computation  with  memory  and  correspondence. 

Our  memory-based  approach  to  the  analysis  and  syn¬ 
thesis  problems  may  be  described  as  a  scheme  that  learns 
from  examples.  It  may  reflect  the  organization  of  human 
vision  for  object  recognition  (see  for  instance  Poggio, 
1990  and  Poggio  and  Hurlbert.  1993).  One  may  draw 
the  same  connection  in  the  special  case  of  interpreting 
facial  expressions.  Although  the  understanding  of  group 
muscles  is  t.he  physically  correct  thing  to  do,  it.  may  not 
be  the  way  humans  interpret  facial  expressions. 

We  have  seen  (Section  5.3)  that  a  regularization  net¬ 
work  can  be  used  to  map  vector  fields  onto  vector  fields. 
Instead  of  using  an  analysis  and  a  synthesis  network  we 
can  map  images  onto  other  images  directly  —  by  ex¬ 
ploiting  the  observation  that  the  estimated  output  of  a 
synthesis  network  is  always  a  linear  combination  of  the 
example  images  (vectorized).  This  observation  makes  a 
direct  connection  to  two  other  results  in  the  context  of 
visual  recognition:  the  linear  combination  of  views  (Ull- 
man  A'  Basri.  1991)  and  the  linear  class  of  objects  (Vetter 
A'  Poggio,  1992).  Consequently,  the  linear  mapping  be¬ 
tween  vector  fields  is  exact  under  certain  conditions  that 
include  orthographic  projection,  and  either  the  group  of 
transformations  is  piece-wise  3D  affine,  or  the  example 
set  is  spanning  the  linear  class  of  objects  represented  by 
this  set.  From  a  practical  standpoint,  the  mapping  be¬ 
tween  vector  fields  introduces  a  scheme  for  synthesizing 
new  images  that  does  not  require  specifying  control  pa¬ 
rameters.  We  introduced  this  scheme  in  the  context  of 


teleconferencing  where  the  sender  solves  and  sends  the 
coefficients  of  the  linear  combination  required  to  span  the 
novel  image  from  the  example  set  of  images  (as  shown 
in  Appendix  B.l,  a  combination  of  eigenfunctions  of  the 
example  set  can  be  used  instead). 

The  results  and  schemes  presented  in  this  paper  may 
have  several  related  applications.  We  have  discussed 
mainly  two  basic  applications:  analysis  for  gesture  or 
pose  estimation,  synthesis  for  computer  graphics  anima¬ 
tion  and  teleconferencing.  Other  potential  applications 
lie  in  the  realm  of  virt  ual  reality,  t  raining  systems,  man- 
machine  interfaces  (including  video  speech  synthesis), 
computer-aided  design  and  animation  and  object  recog¬ 
nition.  For  instance,  the  approach  demonstrated  with 
our  analysis  network  may  be  developed  into  a  trainable 
and  rather  universal  control  interface  playing  the  role  of 
a  computer  mouse  or  a  body  suit  or  a  computer  glove, 
depending  on  the  available  examples  and  sensors  (Jones 
and  Poggio,  in  preparation).  In  another  example,  if  the 
synthesis  module  is  trained  on  examples  of  a  cartoon 
character  or  of  a  different  person  (see  the  example  set 
of  figure  7)  from  the  analysis  network,  then  one  actor's 
face  can  be  used  to  ‘'direct'’  a  cartoon  character  or  the 
face  of  another  person,  as  shown  in  figure  8.  Variat  ions 
on  this  theme  may  have  applications  in  various  flavors 
of  virtual  reality,  videogames,  and  special  effects. 
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matrix,  Y  is  a  q  x  N  matrix  and  G  is  a  n  x  N  matrix. 
Consider  the  equations 

y(x)  =  Cg(x)  (13) 

and  its  dual 

y(x)  =  Yb(x)  (14) 

Equation  13  can  be  regarded  as  a  mapping  from  x  to 
y,  whereas  equation  14  may  be  regarded  as  a  mapping 
from  the  space  of  the  6  coefficients  into  y.  The  trans¬ 
formation  from  a  point  x  in  X  space  and  a  point  6  in  B 
space  is  given  by 

b(x)  =  G+g(x).  (15) 

Given  x,  b  is  determined  by  the  previous  equation: 
the  viceversa  is  not  in  general  possible. 

Notice  that  when  n  =  N  each  y  is  approximated  by 
a  linear  combination  of  the  N  examples.  When  n  <  N 
this  is  still  true  but  we  can  also  consider  the  following 
equation 


A  Regularization  networks 

corresponding  to  bilinear  splines 

Consider  x  =  (x,y)  with  4  examples  y *,  i  =  1,  -  - -4, 
located  at  the  corners  of  the  unit  square.  If  we  use 
G(x)  =  |x||y|  as  in  figure  5,  which  corresponds  to  in¬ 
terpolation  by  tensor  product  piece-wise  linear  splines, 
then 


and 


G  = 


(0  0  0  1\ 
0  0  1  0  ) 
0  10  0 
1  0  0  0/ 


G+ 


(0  0  0  1\ 
0  0  1  0  I 

0  1  0  0  I 
1  0  0  0/ 


Thus  6,  =  J2'j  =  \(^r+)i,j9j  which  implies  by  defining 
with  x,  the  x,  y  coordinates  of  example  i 


y(x)  =  Cg(x)  (16) 

which  can  be  interpreted  to  mean  that  each  y  is  ap¬ 
proximated  by  the  linear  combination  of  n  vectors  -  the 
columns  of  C  -  that  are  linear  transformations  of  all  the 
N  examples  y  with  coefficients  given  by  the  elements  of 
g,  that  is 


n 

y  =  5Zs"c'-  (17) 

i=i 

This  can  be  compared  with  the  previous  equation  7: 
N 

y  =  YlbiVi  (18) 

i=i 

B.l  K-L  decomposition 

Consider  the  previous  equation 

N 

y  =  Yb(x)  =  ]T  b,(x)y,  ( 19) 

1=1 


bi(x)  =  G(x  -  x5-i)  (12) 

with  i  =  1 ,  -  •  - ,  4.  For  our  choice  of  G(x)  =  |x||y|  equa¬ 
tion  12  gives  the  bilinear  coefficients 

6i(x)=  |x-x5-i||y-!/5-i|. 

These  simple  relations  can  be  easily  extended  to  more 
than  2  dimensions,  provided  the  examples  are  at  the  cor¬ 
ners  of  the  unit  hypercube  in  input  space. 

B  The  learning  networks  and  its  dual 
representation 

Let  us  assume  that  the  input  dimensionality  is  d,  the 
network  has  n  centers  with  n  <  N ,  the  output  dimen¬ 
sionality  is  q  and  there  are  N  examples  in  the  training 
set.  This  means  that  x  is  a  d  x  1  matrix,  C  is  a  q  x  n 


and  expand  each  of  the  columns  of  Y  -  that  is  each  of 
the  examples  y<  -  in  Q  <  N  of  the  eigenfunctions  of 
YYt  -  which  has  dimensionality  q  x  q.  Thus 

Q 

y.  =  E  Om  =  Y'd-  (20) 

m=  1 

and  combining  the  last  two  equations 
N  Q 

y  =  £  Mx)yi  d‘my'm  =  Y'b'(x)  (21) 

1=1  m  =  l 

where  Y'  is  a  q  x  Q  matrix  and  b'(x)  is  a  Q  dimensional 
vector  of  the  coefficients  of  the  expansion  of  the  vector 
y  in  the  eigenfunctions  of  YYr,  that  is 

b'(x)  =  Y'Ty(x) 


(22) 


where  Y'  is  the  matrix  with  columns  the  eigenvectors  of 

YYt 

Thus  each  output  y  can  be  approximated  by  a  linear 
combination  of  eigenfunctions  of  the  example  set  y, .  As 
a  consequence,  a  smaller  set  of  example  images  may  be 
used  in  the  linear  combination  technique  than  the  num¬ 
ber  of  examples,  provided  they  are  chosen  to  be  the  most 
significant  eigenfunctions  of  the  correlation  matrix  of  the 
example  set  and  they  are  sufficient  to  approximate  well 
novel  images. 

Of  course  the  eigenvectors  of  YYT  can  be  found 
in  terms  of  the  eigenvectors  of  YTY  which  may  have 
a  much  lower  dimensionality  (for  instance  when  the 
columns  of  Y  are  images).  1 

C  A  comparison  of  the  two  approaches 
to  teleconferencing 

The  first  approach  to  teleconferencing  exploits  the  fol¬ 
lowing  decomposition  i, see  equation  17)  to  be  used  at 
the  receiver  site  for  reconstruction  of  the  image  y 

r 

y  =  ^</ici  (23) 

j=i 

The  alternative  technique  based  on  the  dual  represen¬ 
tation  of  the  networks  uses  equation  7: 

N 

y  =  5Z6,y'  (24> 

i=i 

We  have  shown  that  the  two  equations  are  equivalent 
since  they  are  two  ways  of  rewriting  the  same  expression. 
Reconstruction  performance  is  therefore  expected  to  be 
the  same.  There  are  however  somewhat  different  compu¬ 
tational  tradeoffs  offered  by  the  two  techniques.  Let  us 
denote  in  this  appendix  with  q  the  image  dimensional¬ 
ity  and  with  d  the  dimensionality  of  the  pose-expression 
space.  In  the  first  technique  (equation  23)  the  sender  has 
to  send  in  batch  mode  and  store  (SBS)  n  d-dimensional 
centers  plus  n  x  q  parameters  c  and  send  at  run  time 
(SR)  d  numbers  (the  x  vector).  In  the  second  tech¬ 
nique  -  equation  24  -  the  sender  has  to  send  in  batch 
mode  beforehand  and  memorize  at  the  receiver  site  A' 
q-dimensional  y /:  it  has  to  send  at  run  time  N  numbers 
6,  .  In  general  d  «  AT;  assume  for  example  N  =  50, 
n  =  20.  q  —  105.  d  =  5.  Then  the  first,  technique  has 
SBS  =  n  x  d+n  x  q  —  2106  and  SR  =  5  while  the  second 
technique  has  SBS  =  Ar  x  q  =  5106  and  SR  =  50.  This 
very  rough  estimate  neglects  precision  issues. 

Only  experiments  can  decide  the  actual  tradeoffs  be¬ 
tween  the  two  techniques,  if  any.  In  terms  of  power  and 
flexibility  the  analysis-synthesis  technique  seems  supe¬ 
rior:  the  two  modules  ( '"sender"  and  “receiver" )  are  in¬ 
dependent  and  could  use  different  examples  in  a  different 

1  Notice  that  there  is  a  connection  between  this  formu¬ 
lation  and  parametric  eigenspace  representation  for  visual 
learning  and  recognition  (Murase  and  Navar,  1993).  In  par¬ 
ticular.  our  b'(x)  is  similar  to  their  parametrized  space  for 
pose.  A  significant  difference  is  however  that  our  approach 
recognizes  the  critical  importance  of  correspondence  between 
the  images  whereas  theirs  does  not. 


number  and  in  different  positions  in  the  space.  This  is 
not  so  for  the  linear  combination  technique  which  on  the 
other  hand  may  be  simpler  in  certain  cases  since  it  does 
not  require  the  assignment  by  the  user  of  pose-expression 
coordinates  to  the  example  images  (which  of  course  could 
also  be  done  automatically). 

In  the  case  of  graphic  applications  the  requirements 
are  somewhat  different.  Assume  n  =  .V  (the  vanilla 
RBF  case).  In  the  case  of  the  first  technique  learning 
requires  the  off-line  computation  and  storage  of  the  q.d 
matrix  C  =  YG+:  at  run  time  the  system  computes 
the  d,  1  vector  g(x)  and  then  y  =  Cg.  In  the  case  of  the 
second  technique  learning  requires  the  storage  of  the  q.  A’ 
matrix  Y  and  the  computation  and  storage  of  the  .X.d 
matrix  G+;  at  run  time  the  system  needs  to  compute 
the  d,  1  vector  g(x).  the  N.  1  vector  b  =  G+g  and  finally 
y  =  Yb.  The  first  technique  requires  less  operations  at 
run  time. 

D  Changing  reference  frames  for  the 
rend  operation 

During  the  rend  operation, 

img  —  rend(y.  imgt(T.yt(T), 

if  the  reference  image  imgrej  and  the  example  image 
*m9ter  (use<f  to  sample  facial  texture)  are  not  the  same, 
the  facial  geometry  vector  y  must  have  its  reference 
frame  changed  to  that  of  image  imgtfT.  As  shown  in 
figure  13.  this  is  accomplished  by 

y'  —  y  -yur 

y"  —  warp-vect(y'.y,tr). 

where  the  first  step  subtracts  out  the  geomet  ry  of  imgt(T. 
but  still  in  the  reference  image  of  imgrfj.  The  second 
step  uses  a  2D  warp  to  translate  the  correspondences  to 
the  new  reference  image  imgUr.  Finally,  the  image  is 
rendered  using  a  2D  warp  from  imgter 

img  —  warp(tm0,fJ..y"). 
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Figure  1:  The  most  general  Regularization  Network  with  one  hidden  layer  and  vector  output.  In  the  analysis  case 
the  network  will  have  images  as  inputs  and  pose-expression  parameters  as  outputs;  in  the  synthesis  case  inputs  are 
pose  parameters  and  outputs  are  images. 


Figure  2:  In  our  demonstrations  of  the  example-based  approach  to  image  analysis  and  synthesis,  the  example  images 
img1  through  img4  are  placed  in  a  2D  rotation-expression  parameter  space  (x,  y),  here  at  the  corners  of  the  unit 
square.  For  analysis,  the  network  learns  the  mapping  from  images  (inputs)  to  parameter  space  (output).  For 
synthesis,  we  synthesize  a  network  that  learns  the  inverse  mapping,  that  is  the  mapping  from  the  parameter  space 
to  images. 


V 


Image  Analysis 


Figure  3:  The  analysis  network  trained  on  the  four  examples  of  figure  2  estimates  the  two  pose/expression  parameters 
for  a  novel  image.  The  figure  shows  four  cases  of  novel  images  -  that  is  different  from  the  examples  in  the  training 
set  -  and  the  output  of  the  network  for  each  of  them. 
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Figure  4:  Three  cases  of  ID  interpolation  between  the  two  example  images  bordered  in  black.  In  this  case  the 
synthesis  network  has  one  input  only,  two  centers  (i.e.  hidden  units)  and  132  x  148  outputs,  with  each  output 
representing  the  x  or  the  y  coordinate  of  a  labeled  pixel.  The  third  case  is  an  amusing  failure  case  where  the  optical 
flow  algorithm  failed  to  find  correct  correspondences. 


15 


Figure  5:  In  this  example  of  multidimensional  image  synthesis  the  input  variables  are  two  -  rotation  and  smile  -  while 
the  output  of  the  network  has  as  many  dimensions  as  twice  the  number  of  pixels  in  the  images  (since  it  represents 
the  x,  y  coordinates  of  each  ordered  pixel).  The  four  training  examples  -  the  same  shown  in  figure  2  -  are  singled  out 
by  black  borders.  All  other  images  are  synthesized  by  a  regularization  network  with  G'(x)  =  |jr||y|,  which  effectively 
performs  bilinear  interpolation,  after  “learning”  the  appropriate  parameters  from  the  4  examples. 
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Analysis  and  Synthesis 


* 


4 


( x.y )  =  (0.497.0.072) 


(x,  y)  =  (0.670,0.668) 


(*.S/)  = 


(x.y)  =  (0.152. 0.810) 


Figure  6:  In  each  of  the  boxed  image  pairs,  the  novel  input  image  on  the  left  is  fed  into  an  analysis  RBF  network  to 
estimate  rotation  x  and  expression  y  (as  shown  in  figure  3).  These  parameters  are  then  fed  into  the  synthesis  module 
of  figure  5  that  synthesizes  the  image  shown  on  the  right.  This  figure  can  be  regarded  as  a  very  simple  demonstration 
of  very-low  bandwidth  teleconferencing,  only  two  pose  parameters  need  to  be  transmitted  at  run-time  for  each  frame. 


Figure  7:  Another  set  of  example  images  that  the  synthesis  network  uses  to  synt  hesize  a  face  dtffennt  from  the  face 
used  by  the  analysis  network  to  estimate  rotation  and  expression. 
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Directing  Another  Person 


» 


(j\  (/)  =  (0.958.0.056) 


(-»’•*/)  = 


(x.y)  =  (0.197.0.072) 


(r.y)  -  (0.070.  0.668) 


Figure  8:  "Direct ing“  another  person:  in  each  of  the  boxed  image  pairs,  the  input  image  on  the  left  is  fed  to  an 
analysis  RBF  network  to  estimate  rotation  x  and  expression  y.  These  parameters  are  then  fed  into  the  synthesis 
network  trained  with  the  examples  (see  figure  7)  of  anothn  person,  which  synthesizes  the  image  shown  on  the  right 
In  this  way  the  "actor"  on  the  left  effectively  "directs"  the  person  of  figure  7. 


Example  Set  1 


img5  imgl 


Figure  !>:  The  set  of  example  images  at  the  sender  and  at  the  receiver  site  used  by  the  linear  combination  technique 
demonstrated  in  the  next  figure  (same  images  as  in  figure  2).  In  this  case  a  pose  value  does  not  need  to  be  assigned 
to  each  image. 
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Example  Transmitter/Receiver  Pairs 


X  .  0.565  -0.060  -0.104 
V  :  0.290  0.016  0.107 


A'  :  0.019  0.539  0.095 
V:  0.111  0.880  0.083 


Figure  10:  A  novel  image  (left)  is  decomposed  in  terms  of  a  linear  combination  of  the  example  images  of  the  previous 
figure  and  reconstructed  using  the  same  example  images  at  the  receiver  site.  Separate  6(  coefficients  are  listed  for 
the  x  and  y  components  of  the  vectorized  image  since  they  are  decomposed  separately  using  equation  (9). 


Different  Transmitter/Receiver  Pairs 


A'  :  0.565  -0.060  -0.104 
V  :  0.290  0.016  0.107 


A'  :  1.379  0.266  -0.102 

V  :  1.129  0.259  -0.110 


A  :  0.019  0.539  0.095 
Y:  0.111  0.880  0.083 


Figure  11:  A  novel  image  (left)  is  decomposed  in  terms  of  a  linear  combination  of  the  example  images  of  figure  9 
and  reconstructed  at  the  receiver  site  combining  with  the  same  coefficients  the  example  views  shown  in  figure  7  of  a 
different  person. 
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lb 


p:  prototype 
nov:  novel  input 
y:  vectorized  image 


ft 


im9P  '"fgp+nov  im9P  im9p+ „ov 


Figure  12:  A  face  transformation  is  “learned”  from  a  prototypical  example  transformation.  Here,  face  rotation  and 
smiling  transformations  are  represented  by  prototypes,  yp.  yp  is  mapped  unto  the  new  face  image  imgnov  by  using 
the  correspondences  specified  by  the  flow  ynm .  The  image  imgp+nm  is  synthesized  by  the  system. 


Figure  13:  Changing  the  reference  frame  during  the  rend  operation.  When  the  facial  geometry  y  is  rendered  using  the 
image  texture  from  imgUz  -  which  is  not  the  reference  image  -  we  must  compute  y  relative  to  imgUT,  producing  y". 
Here,  point  P  corresponds  with  Prej  and  P(ex  in  the  reference  image  imgrej  and  texture  image  imgtex,  respectively. 
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